In this project, we will be analysing how the earning differ between men and women in the US. This data was collected in 2019 so results may have changes since then.
Questions
Some of the questions I will be trying to answer in this report are:
- Is there really a gender gap between earnings and number of females in certain positions?
- Is the gender gap the same across industries or positions?
- Are some age groups more affected?
- Is there a difference between full-time and part-time?
Loading the data
We are loading three datasets that all include important information that we will be used in this anaylsis. The three datasets help bring more data to the same point, but cannot be combined since they all show different aspects of the same situation. The only thing they have in common the timeline, but that is not a good factor for merging.
Data understanding
For the data understanding, we will be focusing on the first table comparing wages across positions for both genders. This will be the main table used for analysis and the others will be used or merged as needed.
| year | occupation | major_category | minor_category |
|---|---|---|---|
| Min. :2013 | Length:2088 | Length:2088 | Length:2088 |
| 1st Qu.:2014 | Class :character | Class :character | Class :character |
| Median :2014 | Mode :character | Mode :character | Mode :character |
| Mean :2014 | NA | NA | NA |
| 3rd Qu.:2015 | NA | NA | NA |
| Max. :2016 | NA | NA | NA |
| NA | NA | NA | NA |
| total_workers | workers_male | workers_female | percent_female |
|---|---|---|---|
| Min. : 658 | Min. : 0 | Min. : 0 | Min. : 0.00 |
| 1st Qu.: 18687 | 1st Qu.: 10765 | 1st Qu.: 2364 | 1st Qu.: 10.73 |
| Median : 58997 | Median : 32302 | Median : 15238 | Median : 32.40 |
| Mean : 196055 | Mean : 111515 | Mean : 84540 | Mean : 36.00 |
| 3rd Qu.: 187415 | 3rd Qu.: 102644 | 3rd Qu.: 63326 | 3rd Qu.: 57.31 |
| Max. :3758629 | Max. :2570385 | Max. :2290818 | Max. :100.00 |
| NA | NA | NA | NA |
| total_earnings | total_earnings_male | total_earnings_female |
|---|---|---|
| Min. : 17266 | Min. : 12147 | Min. : 7447 |
| 1st Qu.: 32410 | 1st Qu.: 35702 | 1st Qu.: 28872 |
| Median : 44437 | Median : 46825 | Median : 40191 |
| Mean : 49762 | Mean : 53138 | Mean : 44681 |
| 3rd Qu.: 61012 | 3rd Qu.: 65015 | 3rd Qu.: 54813 |
| Max. :201542 | Max. :231420 | Max. :166388 |
| NA | NA’s :4 | NA’s :65 |
| wage_percent_of_male |
|---|
| Min. : 50.88 |
| 1st Qu.: 77.56 |
| Median : 85.16 |
| Mean : 84.03 |
| 3rd Qu.: 90.62 |
| Max. :117.40 |
| NA’s :846 |
Looks like the data is either linear or normal so we can use it for analysis. The only fields that look correlated are the ones that have been calculated from each other or that show the same measurements such as number of workers or earning.
This plot does make it look like where men ear a certain amount women ear less. However, since this plots everyone across all jobs this may be a bit general, so here we might be comparing apples to oranges. Therefore, we should look at it based on positions.
This data does include some time series-like aspect with the year variable, so that should be checked.
plot(x = obs_gender$year, y = obs_gender$total_workers)
plot(x = obs_gender$year, y = obs_gender$total_earnings)Looking at both the total earning and the total number of workers, there seems to have not been much change in the period between 2013 and 2016. Therefore, this factor can be discarded when making comparisons.
Number of women and their income in a different industries
female <- obs_gender[, list(total_workers = sum(total_workers),
percent_female = mean(percent_female),
total_earnings = mean(total_earnings),
wage_percent_of_male = mean(wage_percent_of_male, na.rm = TRUE)),
by = 'minor_category']
theme_custom <- theme(
text = element_text(family = "Palatino", size = 12),
legend.position = 'bottom',
plot.background = element_rect(color = "black", size = 1)
)ggplot( female, aes( x = minor_category, y = total_workers)) +
geom_histogram( stat = 'identity') +
theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = 'Number of total workers per industry', y = 'Total number of workers')
ggplot( female, aes( x = minor_category, y = percent_female)) +
geom_histogram( stat = 'identity') +
theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = 'Percentage of female workers per industry', y = 'Percentage of female workers')Looks like there are definitely industries where there are significantly less women, such as Construction or Instalations. These are usually considered male professions so this gap is expected with less than 10% women. However, hopefully women find these professions for themselves as well in the future if they want to go into them. However, there are also industries such as Healthcare and Education which are usually considered female professions where the percentage of woemn is well over 70%.
ggplot( female, aes( x = minor_category, y = total_earnings)) +
geom_histogram( stat = 'identity') +
theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = 'Yearly earning per industry', y = 'Salary ($)')
ggplot( female, aes( x = minor_category, y = wage_percent_of_male)) +
geom_histogram( stat = 'identity') +
theme_bw() + theme_custom + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = 'Percentage of female wage per industry', y = 'Percentage of female wage')Interestingly while the pay varies significantly across all industries, the difference between men and women stays relatively stable at around 70%. This shows that this does seem to be a systematic issue and not something that just comes up in some industries or workplaces.
Change in wage percengate over the past 30 years
## Classes 'data.table' and 'data.frame': 264 obs. of 3 variables:
## $ Year : int 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 ...
## $ group : chr "Total, 16 years and older" "Total, 16 years and older" "Total, 16 years and older" "Total, 16 years and older" ...
## $ percent: num 62.3 64.2 64.4 65.7 66.5 67.6 68.1 69.5 69.8 70.2 ...
## - attr(*, ".internal.selfref")=<externalptr>
## [1] "Total, 16 years and older" "16-19 years"
## [3] "20-24 years" "25-34 years"
## [5] "35-44 years" "45-54 years"
## [7] "55-64 years" "65 years and older"
Looks like there is an additional Total column which I will excluse since it includes all the other categories therefore will not add more information.
wage_gap <- earnings_female[group != "Total, 16 years and older", ]
ggplot(wage_gap, aes(x = Year, y = percent, color = group)) +
geom_point() +
geom_smooth(method = 'lm', se = TRUE) +
theme_bw() + theme_custom +
transition_states(group) +
labs(title = '{closest_state}', y = 'Percent of female wages of men (%)')## `geom_smooth()` using formula 'y ~ x'
It is interesting to see that the younger the people are the smaller the gap is. This could be with the elderly when they started working the gap was bigger and they could only improve it so much. It could also be because of the above seen difference in work.
The makeup of the work force
ggplot(data = employed_gender) +
geom_line(aes(x = year, y = full_time_female), color = 'maroon2') +
geom_line(aes(x = year, y = full_time_male), color = 'blue') +
theme_bw() + theme_custom +
transition_reveal(year) +
labs(title = 'Change in percentage of Full-time work', y = "Percentage of full-time workers")
ggplot(data = employed_gender) +
geom_line(aes(x = year, y = part_time_female), color = 'maroon2') +
geom_line(aes(x = year, y = part_time_male), color = 'blue') +
theme_bw() + theme_custom +
transition_reveal(year) +
labs(title = 'Change in percentage of part-time work', y = "Percentage of part-time workers") The blue line represents men and the pink one women.
It is really interesting how gender affects the type of jobs that are most common. While it has decreased over the past 30 years, men still hold full-time jobs in over 80% of cases. Women on the other hand only have full-time jobs in around 70-75% of cases and about 25% have part-time jobs. This is much higher than the around 10-15% of men.
Conclusion
Overall it seems that the gendergap is actually real in many ways. Women are not represented in some positions and are payed less across all types of positions amd have been for the past 30 years. This doesn’t change when they get older, it actually becomes worse. Further, they are also less likely to hold full-time jobs by about 10%. This is a very important to have this data layed out and shown. While this analysis doesn’t prove any type of causation, it does show some type of correlation between gender and amount earned.